Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify asset path at top level in API #1700

Closed
wants to merge 2 commits into from
Closed

Conversation

jjnesbitt
Copy link
Member

Closes #1687, follow-up to #1689

This is a breaking API change.

Previously, for asset creation/update, the path of the asset was specified within the metadata field. This is confusing, as an asset's path is stored directly on the model itself (not in the metadata), and was the leading cause of #1682. This PR changes that behavior, by requiring an asset's path to be specified directly at the top level, alongside metadata, blob_id, and zarr_id.

Copy link
Member

@waxlamp waxlamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but I had one question about the user-facing interface changes.

Comment on lines -216 to +220
if 'path' not in data['metadata'] or not data['metadata']['path']:
raise serializers.ValidationError({'metadata': 'No path specified in metadata.'})
if 'path' in data['metadata']:
raise serializers.ValidationError(
{'metadata': 'Path must only be specified at top level.'}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the part that makes this a breaking change?

Related question: it would be reasonable under the general usage of PUT endpoints to copy the metadata from a GET call and then paste it into the PUT call; if someone does that, will the presence of the path entry cause this error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is half of the breaking change yes, the other half is being required to specify the path at the top level (i.e. violating either of these will fail the request).

Related question: it would be reasonable under the general usage of PUT endpoints to copy the metadata from a GET call and then paste it into the PUT call; if someone does that, will the presence of the path entry cause this error?

Yes it will cause this error. If we'd like, we could still allow path in the metadata and just ignore it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we'll disallow path: does this check belong in the serializer or the service layer? Since we would want to prevent it regardless of where it came from, I think it belongs in the service layer. And if that's the case, is path still a necessary element in ASSET_COMPUTED_FIELDS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd still need to keep path as an entry in ASSET_COMPUTED_FIELDS, since it's still returned as a part of the computed metadata (and as such we'd want to prevent storing it in the metadata directly).

The validation error I added here is more a signal externally to prevent confusion / false bug reporting (e.g. "I changed path in the metadata, why does nothing change?"). If we assume we're keeping path as a field in ASSET_COMPUTED_FIELDS, I don't see what the benefit of performing a check in the service layer is, since the field will be stripped out anyway.

What do you think? It's possible I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Jake and I discussed this and settled on generalizing the "path can't be specified as metadata" requirement to all reserved (computed) fields. As it stands, the same bug as #1682 exists for the other fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context computed means "computed from other properties of the asset or asset metadata". In the case of path, since it's stored directly on an asset, it's "computed" by just injecting that field into the metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @AlmightyYakob - i still find it weird that the API requires this as a separate element as opposed to being able to pull it out internally from the metadata as it used to. locally the CLI validates the asset with these fields inside, and hence seems easiest from a programmer standpoint using the API to have one point of entry. since this is a breaking API change, just want to use the opportunity to both reflect that dandi objects are the metadata and ensuring a good pattern for outside clients, which are growing beyond us, interacting with the API.

Copy link
Member Author

@jjnesbitt jjnesbitt Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of this change is to make it more clear the distinction between modifiable and un-modifiable metadata. The issue with the current state of things is as follows:

You have an asset and it's metadata, and you want to change a field in that asset's metadata. Depending on what field you want to modify, it will actually result in no change when the request is made to the API, since some fields are computed/derived/etc, and some are actually modifiable. Things like contentUrl and digest are examples of fields that you cannot modify directly. They are computed/derived from the asset, or its associated asset blob / zarr.

The path field is a unique example, where it's a field that is computed (since that data is stored on the asset itself and is then injected into the metadata), yet prior to this proposed change, was also specified in the metadata, and then stripped out by the server. This is counter-intuitive behavior, and is the reason I originally intended on barring path from being specified in the metadata with this PR. Dan and I also discussed extending this to all computed fields, so that there is a clear one-way data flow for any computed fields (top level -> metadata).

Leaving things the way they are now, how would a consumer of the API know which fields can be modified and which can't? The only way they could know (aside from reading the source code), is through trial and error, which seems undesirable from a user-facing point of view. If we were to update the API to make it explicit the fields which can and can't be modified, I think that would result in a better experience for consumers of the API.

I do see your point, in the sense that the format of the API and that of the Dandi schema would be diverging somewhat. However, regardless of the uniformity between the two formats, the dichotomy between metadata that can be updated, and computed metadata, still exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlmightyYakob - sorry for the delay.

The path field is a unique example, where it's a field that is computed (since that data is stored on the asset itself and is then injected into the metadata), yet prior to this proposed change, was also specified in the metadata, and then stripped out by the server.

regarding path, i don't know why it was being stripped out or inserted by the server. path is indeed special as it is the unique key to an asset. i.e. the same path cannot point to multiple blobs.

The only way they could know (aside from reading the source code), is through trial and error, which seems undesirable from a user-facing point of view.

indeed this is the case right now, and is really because of the API service not trusting clients to determine what those fields are. however, the exception is still for path - which is definitely a client prerogative to set, not for the server to decide. the API should have nothing to say about paths except regarding "our" policies on clashes with either POST/PUT.

can we better understand why path should be computed?

the dichotomy between metadata that can be updated, and computed metadata, still exists.

let's figure out a way to add this info to dandischema itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not grasp why it is exactly computed, besides may be that it is first extracted from metadata for the asset record and then placed back into it (why?).

@satra
Copy link
Member

satra commented Oct 6, 2023

it may really be helpful to document the lifecycle of an asset somewhere (can point to code pieces as needed). this should include what is injected asset metadata and when.

also if we are making API changes related to the asset, can we also address the blob/zarr vs blob_id/zarr_id difference between API returns and API POST/PUT endpoints and make them consistent.

@jjnesbitt
Copy link
Member Author

it may really be helpful to document the lifecycle of an asset somewhere (can point to code pieces as needed). this should include what is injected asset metadata and when.

Good point, I agree.

also if we are making API changes related to the asset, can we also address the blob/zarr vs blob_id/zarr_id difference between API returns and API POST/PUT endpoints and make them consistent.

You're referring to #1686, correct? I think we can do that, but I'd prefer to do it in a separate PR, to keep things clean. We should delay the merging/deployment of these two PRs until the CLI has the corresponding changes ready to go.

@yarikoptic
Copy link
Member

I feel that this a "good to have" and not really "has to have" change, and that is why not sure if it is worth the API breakage at this point. Should we may be better

  • establish "dandi-api-2.0" (or whatever we want to call it, e.g. dandi-api-0.2 to signify that it is not yet stable despite all the years; or dandi-api-1.0 - as the API we would approach as "finally a stable one to aim") project where we start collecting the issues/PRs which would require API breakage and bundle them all.
  • do get back to documenting "lifecycle" and formalize then how lifecycle would need to change in that "new" one

WDYT?

@jjnesbitt
Copy link
Member Author

I'm closing this PR, and will be making a new PR containing the generalized solution discussed here. That will still constitute a breaking API change, but will be less disruptive, as it'll be contained in one PR.

@jjnesbitt jjnesbitt closed this Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify asset path at top level in asset update endpoint
5 participants